1. Overall Machine Learning Picture
  2. Cloud Migration Solution (Amazon Web Services)
  3. Classification example
  4. Clustering example
  5. Prediction of AAPL stock price - Regresssion
  6. Prediction of Kmart sales - Classification

1. Overall Machine Learning Picture (Python science-kit learn libs)

2. Cloud Migration Solution (Amazon Web Services)

3. Classification Example

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

In [52]:
from aihub import aihub
import matplotlib

font = {'family' : 'normal',
        'weight' : 'regular',
        'size'   : 12}

matplotlib.rc('font', **font)

aihub.example_classification()

4. Clustering Example

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining.

In [2]:
aihub.example_cluster()

5. Prediction of AAPL stock price - Regresssion

  • Regression: In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables.

  • Lasso Regression: In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.

  • Support Vector Machines: In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis

a) Import libraries (hub libraries are in-house developed)

In [3]:
import warnings
import pandas as pd
import matplotlib 
import os

os.chdir("d:/machine-learning/anahub")

# developed by Usman Ahmad
from datahub import datahub
from auxhub import auxhub
from edahub import edahub
from algohub import algohub
from aihub import aihub

b) Load AAPL stock price data (source: Google finance data)

In [4]:
dh = datahub()
aapl_file = "d:/data/NASDAQ_AAPL.txt"
dh.add_dataset("aapl", aapl_file)
dh.get_dataset("aapl").head()
Instantiating Data Hub class
Out[4]:
Ticker Date Open High Low Close Volume
0 AAPL 2.010100e+11 295.01 295.05 294.82 294.82 5235
1 AAPL 2.010100e+11 294.81 294.90 294.80 294.85 7441
2 AAPL 2.010100e+11 294.85 294.98 294.84 294.85 4268
3 AAPL 2.010100e+11 294.83 294.83 294.75 294.83 4012
4 AAPL 2.010100e+11 294.80 294.82 294.64 294.67 13081

c) Exploratory Data Analysis

In machine learning and statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

In [5]:
edahub.init_analysis(dh.get_dataset("aapl"), "Close")
Training Data Description:
Date Open High Low Close Volume
count 2.014000e+03 2014.000000 2014.000000 2014.000000 2014.000000 2.014000e+03
mean 2.010103e+11 308.172672 308.429033 307.882731 308.178322 1.993260e+05
std 4.515289e+05 6.387164 6.400242 6.358531 6.387897 2.321854e+05
min 2.010100e+11 293.350000 293.880000 292.490000 293.400000 1.000000e+02
25% 2.010100e+11 303.645000 303.872500 303.370000 303.672500 6.464325e+04
50% 2.010100e+11 308.530000 308.765000 308.215000 308.545000 1.399070e+05
75% 2.010110e+11 311.850000 312.300000 311.577500 311.865000 2.587075e+05
max 2.010110e+11 320.020000 320.180000 319.140000 320.010000 2.934926e+06
Target Column Description:
Close
count 2014.000000
mean 308.178322
std 6.387897
min 293.400000
25% 303.672500
50% 308.545000
75% 311.865000
max 320.010000
Data Types:
0
float64 5
int64 1
object 1
In [6]:
edahub.get_correlations(dh.get_dataset("aapl"))
Out[6]:
Date Open High Low Close Volume
Date 1.000000 0.475832 0.470969 0.475330 0.476854 -0.108512
Open 0.475832 1.000000 0.997058 0.998114 0.998227 0.022302
High 0.470969 0.997058 1.000000 0.993855 0.996758 0.054850
Low 0.475330 0.998114 0.993855 1.000000 0.998216 -0.002160
Close 0.476854 0.998227 0.996758 0.998216 1.000000 0.027664
Volume -0.108512 0.022302 0.054850 -0.002160 0.027664 1.000000

d) Initialise AI Hub

In [7]:
aih = aihub()
aih.load_data(dh)
aih.datahub.get_names()
Instantiating A.I. hub
Instantiating Data Hub class
Out[7]:
['aapl']

e) Run all regression algo's to predict Close price for following 500 days

Algo's: Multi-variate regression, Lasso regression, Support Vector Machines (Linear, Polynomial)

In [8]:
warnings.simplefilter("ignore")

DAYS_PREDICT = 500

aih.run_regression(
    ["aapl", "aapl",], 
    ["Close", "Volume",],
    [DAYS_PREDICT, DAYS_PREDICT],
    algos = ["reg", "svr_linear", "svr_poly", "lasso"], 
    test_sizes = [0.2], 
    filldrop = ["fill"],
    force_encoding = "le"
)
Out[8]:
Data_Name Main_Algo Algo Target F-Out Test-Size Prediction Score MSE
0 aapl Regression reg Close 500 0.2 [319.37129078766634, 319.2237143555743, 319.05... 0.328151 17.4366
1 aapl Regression svr_linear Close 500 0.2 [319.0294674186048, 318.99549669746347, 319.01... 0.512000 12.9671
2 aapl Regression svr_poly Close 500 0.2 [318.6709009648811, 318.75161218005536, 318.53... 0.311222 17.9795
3 aapl Regression lasso Close 500 0.2 [318.32663530151973, 318.31204642611226, 318.2... 0.320270 16.9062
4 aapl Regression reg Volume 500 0.2 [85098.99564584502, 92118.48452055214, 93092.7... 0.111619 50027572237.3018
5 aapl Regression svr_linear Volume 500 0.2 [134992.92847370188, 134993.62487478842, 13499... -0.049591 51942028050.6512
6 aapl Regression svr_poly Volume 500 0.2 [136434.93696914945, 136437.63976914156, 13643... -0.041536 60819670513.8979
7 aapl Regression lasso Volume 500 0.2 [85039.73773983751, 92733.1341232874, 92880.07... 0.022520 27615024982.5716
In [9]:
dorig = aih.datahub.get_dataset("aapl")
dpred_reg = aih.datahub.get_dataset("aapl.regression.svr_linear.Close.{0}.[0.2].prediction".format(DAYS_PREDICT))
dpred_poly = aih.datahub.get_dataset("aapl.regression.svr_poly.Close.{0}.[0.2].prediction".format(DAYS_PREDICT))

edahub.plot_regression(dorig, dpred_reg, "Close", xlab = "Days", title = "SVR Linear")
In [ ]:
edahub.plot_regression(dorig, dpred_poly, "Close", xlab = "Days", title = "SVR Polynomial")

6. Prediction of Kmart sales - Classification

a) Load Kmar sales data (source: Moody's Analytics)

In [10]:
dk = datahub()
dk.add_datasets({
    "kmart.train": "d:/data/kmart/train_kmart.csv",
    "kmart.test": "d:/data/kmart/test_kmart.csv"})
display(dk.get_dataset("kmart.train").head())
Instantiating Data Hub class
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OUT049 1999 Medium Tier 1 Supermarket Type1 3735.1380
1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018 2009 Medium Tier 3 Supermarket Type2 443.4228
2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OUT049 1999 Medium Tier 1 Supermarket Type1 2097.2700
3 FDX07 19.20 Regular 0.000000 Fruits and Vegetables 182.0950 OUT010 1998 NaN Tier 3 Grocery Store 732.3800
4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013 1987 High Tier 3 Supermarket Type1 994.7052
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
0 FDW58 20.750 Low Fat 0.007565 Snack Foods 107.8622 OUT049 1999 Medium Tier 1 Supermarket Type1
1 FDW14 8.300 reg 0.038428 Dairy 87.3198 OUT017 2007 NaN Tier 2 Supermarket Type1
2 NCN55 14.600 Low Fat 0.099575 Others 241.7538 OUT010 1998 NaN Tier 3 Grocery Store
3 FDQ58 7.315 Low Fat 0.015388 Snack Foods 155.0340 OUT017 2007 NaN Tier 2 Supermarket Type1
4 FDY38 NaN Regular 0.118599 Dairy 234.2300 OUT027 1985 Medium Tier 3 Supermarket Type3
In [ ]:
display(dk.get_dataset("kmart.test").head())

b) Exploratory Data Analysis (EDA)

In [11]:
edahub.init_analysis(dk.get_dataset("kmart.train"), "Item_Outlet_Sales")
Training Data Description:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Outlet_Sales
count 7060.000000 8523.000000 8523.000000 8523.000000 8523.000000
mean 12.857645 0.066132 140.992782 1997.831867 2181.288914
std 4.643456 0.051598 62.275067 8.371760 1706.499616
min 4.555000 0.000000 31.290000 1985.000000 33.290000
25% 8.773750 0.026989 93.826500 1987.000000 834.247400
50% 12.600000 0.053931 143.012800 1999.000000 1794.331000
75% 16.850000 0.094585 185.643700 2004.000000 3101.296400
max 21.350000 0.328391 266.888400 2009.000000 13086.964800
Target Column Description:
Item_Outlet_Sales
count 8523.000000
mean 2181.288914
std 1706.499616
min 33.290000
25% 834.247400
50% 1794.331000
75% 3101.296400
max 13086.964800
Data Types:
0
object 7
float64 4
int64 1
In [12]:
edahub.missing_values_table(dk.get_dataset("kmart.train"))
Your selected dataframe has 12 columns.
There are 2 columns that have missing values.
Out[12]:
Missing Values % of Total Values
Outlet_Size 2410 28.3
Item_Weight 1463 17.2
In [13]:
NUM_CLASSES = 2
dk.add_datasets({
    "kmart.cut.train": auxhub.classify_split(
        dk.get_dataset("kmart.train"), "Item_Outlet_Sales", "Target_Sales", 
        method = "cut", num_classes = NUM_CLASSES),
    
    "kmart.pct.train": auxhub.classify_split(
    dk.get_dataset("kmart.train"), "Item_Outlet_Sales", "Target_Sales", 
        method = "pct", num_classes = NUM_CLASSES),    
})
In [14]:
edahub.plot_frequency(dk.get_dataset("kmart.cut.train"), "Target_Sales")
In [ ]:
edahub.plot_frequency(dk.get_dataset("kmart.pct.train"), "Target_Sales")

c) Encoding categorical fields to numerical (one-hot-encoding / integer encoding)

In [15]:
dk.encode([('kmart.cut.train', "Target_Sales"), ('kmart.pct.train', "Target_Sales")], 
         ['kmart.test'], target_cols = ['Item_Outlet_Sales', 'Target_Sales'], encoding_arr = ["le", "ohe"])
Out[15]:
[('kmart.cut.train.Target_Sales.le.features',
  'kmart.test.Target_Sales.le.features',
  'kmart.cut.train.Target_Sales.le.labels'),
 ('kmart.cut.train.Target_Sales.ohe.features',
  'kmart.test.Target_Sales.ohe.features',
  'kmart.cut.train.Target_Sales.ohe.labels'),
 ('kmart.pct.train.Target_Sales.le.features',
  'kmart.test.Target_Sales.le.features',
  'kmart.pct.train.Target_Sales.le.labels'),
 ('kmart.pct.train.Target_Sales.ohe.features',
  'kmart.test.Target_Sales.ohe.features',
  'kmart.pct.train.Target_Sales.ohe.labels')]
In [16]:
# one-hot encoding
dk.get_dataset("kmart.pct.train.Target_Sales.ohe.features").head()
Out[16]:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Identifier_DRA12 Item_Identifier_DRA24 Item_Identifier_DRA59 Item_Identifier_DRB01 Item_Identifier_DRB13 Item_Identifier_DRB24 ... Outlet_Size_High Outlet_Size_Medium Outlet_Size_Small Outlet_Location_Type_Tier 1 Outlet_Location_Type_Tier 2 Outlet_Location_Type_Tier 3 Outlet_Type_Grocery Store Outlet_Type_Supermarket Type1 Outlet_Type_Supermarket Type2 Outlet_Type_Supermarket Type3
0 9.30 0.016047 249.8092 1999 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 1 0 0
1 5.92 0.019278 48.2692 2009 0 0 0 0 0 0 ... 0 1 0 0 0 1 0 0 1 0
2 17.50 0.016760 141.6180 1999 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 1 0 0
3 19.20 0.000000 182.0950 1998 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 0 0 0
4 8.93 0.000000 53.8614 1987 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 1 0 0

5 rows × 1588 columns

In [17]:
# integer encoding
dk.get_dataset("kmart.pct.train.Target_Sales.le.features").head()
Out[17]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
0 156 9.30 1 0.016047 4 249.8092 9 1999 1 0 1
1 8 5.92 2 0.019278 14 48.2692 3 2009 1 2 2
2 662 17.50 1 0.016760 10 141.6180 9 1999 1 0 1
3 1121 19.20 2 0.000000 6 182.0950 0 1998 3 2 0
4 1297 8.93 1 0.000000 9 53.8614 1 1987 0 2 1

c) Data Transformation - Feature Scaling & Normalisation

  • Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalisation and is generally performed during the data preprocessing step.

  • Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalisation.

In [18]:
# one-hot encoded featured, labels, test data to predict
d2 = dk
ohe_features = [x for x in d2.get_names() if "features" in x and "train" in x and ".ohe." in x and not "full" in x]
ohe_labels = [x for x in d2.get_names() if "labels" in x and ".ohe." in x]
ohe_test_features = [x for x in d2.get_names() if "features" in x and "test" in x and ".ohe." in x and not "full" in x]
ohe_zlf = list(zip(ohe_features, ohe_labels))

# integer encoded featured, labels, test data to predict
le_features = [x for x in d2.get_names() if "features" in x and "train" in x and ".le." in x and not "full" in x]
le_labels = [x for x in d2.get_names() if "labels" in x and ".le." in x]
le_test_features = [x for x in d2.get_names() if "features" in x and "test" in x and ".le." in x and not "full" in x]
le_zlf = list(zip(le_features, le_labels))
In [19]:
dk.run_imputer_scaler(ohe_zlf, ohe_test_features)
Out[19]:
[('kmart.cut.train.Target_Sales.ohe.features.impsca',
  'kmart.test.Target_Sales.ohe.features.impsca'),
 ('kmart.pct.train.Target_Sales.ohe.features.impsca',
  'kmart.test.Target_Sales.ohe.features.impsca')]
In [20]:
dk.run_imputer_scaler(le_zlf, le_test_features)
Out[20]:
[('kmart.cut.train.Target_Sales.le.features.impsca',
  'kmart.test.Target_Sales.le.features.impsca'),
 ('kmart.pct.train.Target_Sales.le.features.impsca',
  'kmart.test.Target_Sales.le.features.impsca')]

d) Continuing with EDA

Kernel Density Estimation plots

KDE is a method of inferring the relationship of field in question against the target column. It is a non-parametric method of estimating the probability density function (PDF) of a continuous random variable. It is non-parametric because it does not assume any underlying distribution for the variable.

In [21]:
font = {'family' : 'normal',
        'weight' : 'regular',
        'size'   : 10}

matplotlib.rc('font', **font)

df1 = dk.get_dataset('kmart.pct.train.Target_Sales.le.full.features')
edahub.plot_density_all(df1, "Target_Sales", NUM_CLASSES)
Plotting KDE of Item_Identifier
Plotting KDE of Item_Weight
Plotting KDE of Item_Fat_Content
Plotting KDE of Item_Visibility
Plotting KDE of Item_Type
Plotting KDE of Item_MRP
Plotting KDE of Outlet_Identifier
Plotting KDE of Outlet_Establishment_Year
Plotting KDE of Outlet_Size
Plotting KDE of Outlet_Location_Type
Plotting KDE of Outlet_Type
Plotting KDE of Item_Outlet_Sales
In [22]:
edahub.plot_density_all(df1, "Item_Outlet_Sales", NUM_CLASSES)
Plotting KDE of Item_Identifier
Plotting KDE of Item_Weight
Plotting KDE of Item_Fat_Content
Plotting KDE of Item_Visibility
Plotting KDE of Item_Type
Plotting KDE of Item_MRP
Plotting KDE of Outlet_Identifier
Plotting KDE of Outlet_Establishment_Year
Plotting KDE of Outlet_Size
Plotting KDE of Outlet_Location_Type
Plotting KDE of Outlet_Type
Plotting KDE of Target_Sales

Full correlation heatmap

In [23]:
groups = ["Item", "Outlet"]
data_sets = auxhub.seperate_data(df1, groups)
edahub.corr_heatmap(df1)
In [ ]:
edahub.corr_heatmap(data_sets["Item"])
In [ ]:
edahub.corr_heatmap(data_sets["Outlet"])

Pairs Plot (KDE)

In [24]:
item_data = pd.concat([ data_sets['Item'], df1[[ "Item_Outlet_Sales", "Target_Sales" ]] ], axis = 1)
edahub.pairs_plot(item_data, "Item_Outlet_Sales", "Target_Sales")
In [25]:
outlet_data = pd.concat([ data_sets['Outlet'], df1[[ "Item_Outlet_Sales", "Target_Sales" ]] ], axis = 1)
edahub.pairs_plot(outlet_data, "Item_Outlet_Sales", "Target_Sales")
In [26]:
df2 = dk.get_dataset('kmart.pct.train.Target_Sales.le.features.impsca')
edahub.elbow_curve(df2, algo = "kmeans", max_c = 35)

d) Predicting Target Sales

In [27]:
aih2 = aihub()
aih2.load_data(dk)

warnings.simplefilter("ignore")

aih2.run_classifier(
    ["kmart.pct.train.Target_Sales.le.features.impsca","kmart.pct.train.Target_Sales.ohe.features.impsca"],
    ["kmart.test.Target_Sales.le.features.impsca","kmart.test.Target_Sales.ohe.features.impsca"],
    ["kmart.pct.train.Target_Sales.le.labels","kmart.pct.train.Target_Sales.ohe.labels"],
    algos = ["dtree", "rforest", "gauss", "quad"]
)
Instantiating A.I. hub
Instantiating Data Hub class
Out[27]:
Train_Name Test_Name Labels_Name Main_Algo Algo Valid-Prediction Valid-Score Valid-Z-Score Test-Prediction Z-Score
0 kmart.pct.train.Target_Sales.le.features.impsca kmart.test.Target_Sales.le.features.impsca kmart.pct.train.Target_Sales.le.labels Classifier dtree [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, ... 0.826235 [[0.13391399020141534, 0.8660860097985846], [0... [0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, ... [[0.6203576341127923, 0.3796423658872077], [0....
1 kmart.pct.train.Target_Sales.le.features.impsca kmart.test.Target_Sales.le.features.impsca kmart.pct.train.Target_Sales.le.labels Classifier rforest [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, ... 0.805702 [[0.36180118798049243, 0.6381988120195076], [0... [0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, ... [[0.5200863045739511, 0.4799136954260489], [0....
2 kmart.pct.train.Target_Sales.le.features.impsca kmart.test.Target_Sales.le.features.impsca kmart.pct.train.Target_Sales.le.labels Classifier gauss [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, ... 0.771911 [[0.10291375993623142, 0.8970862400637686], [0... [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ... [[0.5750205163681031, 0.424979483631897], [0.9...
3 kmart.pct.train.Target_Sales.le.features.impsca kmart.test.Target_Sales.le.features.impsca kmart.pct.train.Target_Sales.le.labels Classifier quad [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, ... 0.796551 [2.9544029193262658, -4.424600983966915, 0.545... [0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ... [-0.28390905315725945, -1.1809279322872888, -6...
4 kmart.pct.train.Target_Sales.ohe.features.impsca kmart.test.Target_Sales.ohe.features.impsca kmart.pct.train.Target_Sales.ohe.labels Classifier dtree [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, ... 0.826235 [[0.11871708951651508, 0.881282910483485], [0.... [0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, ... [[0.6046153846153847, 0.3953846153846154], [0....
5 kmart.pct.train.Target_Sales.ohe.features.impsca kmart.test.Target_Sales.ohe.features.impsca kmart.pct.train.Target_Sales.ohe.labels Classifier rforest [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ... 0.509797 [[0.49841680951748024, 0.5015831904825198], [0... [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, ... [[0.49841680951748024, 0.5015831904825198], [0...
6 kmart.pct.train.Target_Sales.ohe.features.impsca kmart.test.Target_Sales.ohe.features.impsca kmart.pct.train.Target_Sales.ohe.labels Classifier gauss [1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, ... 0.818960 [[0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0... [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, ... [[1.0, 0.0], [1.0, 2.721762061667525e-141], [1...
7 kmart.pct.train.Target_Sales.ohe.features.impsca kmart.test.Target_Sales.ohe.features.impsca kmart.pct.train.Target_Sales.ohe.labels Classifier quad [1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, ... 0.772733 [72087.25859727155, -1.8619847846491892e+32, 5... [0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ... [-9.52041644091836e+32, 3398.379479481703, 7.0...
In [28]:
pred = aih2.datahub.get_dataset('kmart.pct.train.Target_Sales.ohe.features.impsca.classifier.dtree.prediction')
zscore = aih2.datahub.get_dataset('kmart.pct.train.Target_Sales.ohe.features.impsca.classifier.dtree.Z-score')
new_df = aih2.datahub.get_dataset('kmart.test')
pred = pred[list(pred.columns)[0]]
new_df['Target_Sales_Prediction'] = pred
new_df['Z-Score-0'] = zscore[0]
new_df['Z-Score-1'] = zscore[1]
new_df.head(20)
Out[28]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Target_Sales_Prediction Z-Score-0 Z-Score-1
0 FDW58 20.750 Low Fat 0.007565 Snack Foods 107.8622 OUT049 1999 Medium Tier 1 Supermarket Type1 0 0.604615 0.395385
1 FDW14 8.300 reg 0.038428 Dairy 87.3198 OUT017 2007 NaN Tier 2 Supermarket Type1 0 0.788187 0.211813
2 NCN55 14.600 Low Fat 0.099575 Others 241.7538 OUT010 1998 NaN Tier 3 Grocery Store 0 1.000000 0.000000
3 FDQ58 7.315 Low Fat 0.015388 Snack Foods 155.0340 OUT017 2007 NaN Tier 2 Supermarket Type1 1 0.257948 0.742052
4 FDY38 NaN Regular 0.118599 Dairy 234.2300 OUT027 1985 Medium Tier 3 Supermarket Type3 1 0.118717 0.881283
5 FDH56 9.800 Regular 0.063817 Fruits and Vegetables 117.1492 OUT046 1997 Small Tier 1 Supermarket Type1 1 0.420079 0.579921
6 FDL48 19.350 Regular 0.082602 Baking Goods 50.1034 OUT018 2009 Medium Tier 3 Supermarket Type2 0 0.991533 0.008467
7 FDC48 NaN Low Fat 0.015782 Baking Goods 81.0592 OUT027 1985 Medium Tier 3 Supermarket Type3 1 0.366667 0.633333
8 FDN33 6.305 Regular 0.123365 Snack Foods 95.7436 OUT045 2002 NaN Tier 2 Supermarket Type1 0 0.604615 0.395385
9 FDA36 5.985 Low Fat 0.005698 Baking Goods 186.8924 OUT017 2007 NaN Tier 2 Supermarket Type1 1 0.118717 0.881283
10 FDT44 16.600 Low Fat 0.103569 Fruits and Vegetables 118.3466 OUT017 2007 NaN Tier 2 Supermarket Type1 1 0.420079 0.579921
11 FDQ56 6.590 Low Fat 0.105811 Fruits and Vegetables 85.3908 OUT045 2002 NaN Tier 2 Supermarket Type1 0 0.788187 0.211813
12 NCC54 NaN Low Fat 0.171079 Health and Hygiene 240.4196 OUT019 1985 Small Tier 1 Grocery Store 0 1.000000 0.000000
13 FDU11 4.785 Low Fat 0.092738 Breads 122.3098 OUT049 1999 Medium Tier 1 Supermarket Type1 1 0.420079 0.579921
14 DRL59 16.750 LF 0.021206 Hard Drinks 52.0298 OUT013 1987 High Tier 3 Supermarket Type1 0 0.991533 0.008467
15 FDM24 6.135 Regular 0.079451 Baking Goods 151.6366 OUT049 1999 Medium Tier 1 Supermarket Type1 1 0.257948 0.742052
16 FDI57 19.850 Low Fat 0.054135 Seafood 198.7768 OUT045 2002 NaN Tier 2 Supermarket Type1 1 0.118717 0.881283
17 DRC12 17.850 Low Fat 0.037981 Soft Drinks 192.2188 OUT018 2009 Medium Tier 3 Supermarket Type2 1 0.118717 0.881283
18 NCM42 NaN Low Fat 0.028184 Household 109.6912 OUT027 1985 Medium Tier 3 Supermarket Type3 1 0.126761 0.873239
19 FDA46 13.600 Low Fat 0.196898 Snack Foods 193.7136 OUT010 1998 NaN Tier 3 Grocery Store 0 1.000000 0.000000
In [29]:
df00 = aih2.datahub.get_dataset('kmart.pct.train.Target_Sales.le.features')
df02 = aih2.datahub.get_dataset('kmart.pct.train.Target_Sales.le.features.impsca')
df03 = aih2.datahub.get_dataset('kmart.pct.train.Target_Sales.le.labels')
edahub.plot_classifiers(df02, df03, list(df00.columns), "Item_Visibility", "Item_MRP", NUM_CLASSES)
kneigh
svc_linear
svc_poly
dtree
rforest
mlp
ada
gauss
quad

e) Predicting target sales (focussing on Item_MRP)

In [30]:
aih2.datahub.get_dataset("kmart.pct.train.Target_Sales.le.features").head()
Out[30]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
0 156 9.30 1 0.016047 4 249.8092 9 1999 1 0 1
1 8 5.92 2 0.019278 14 48.2692 3 2009 1 2 2
2 662 17.50 1 0.016760 10 141.6180 9 1999 1 0 1
3 1121 19.20 2 0.000000 6 182.0950 0 1998 3 2 0
4 1297 8.93 1 0.000000 9 53.8614 1 1987 0 2 1
In [31]:
cols = list(aih2.datahub.get_dataset("kmart.pct.train.Target_Sales.le.features").columns)
cols
Out[31]:
['Item_Identifier',
 'Item_Weight',
 'Item_Fat_Content',
 'Item_Visibility',
 'Item_Type',
 'Item_MRP',
 'Outlet_Identifier',
 'Outlet_Establishment_Year',
 'Outlet_Size',
 'Outlet_Location_Type',
 'Outlet_Type']
In [32]:
aih2.datahub.get_dataset("kmart.pct.train.Target_Sales.le.features.impsca")
Out[32]:
array([[0.10012837, 0.28252456, 0.25      , ..., 0.33333333, 0.        ,
        0.33333333],
       [0.00513479, 0.08127419, 0.5       , ..., 0.33333333, 1.        ,
        0.66666667],
       [0.42490372, 0.77076511, 0.25      , ..., 0.33333333, 0.        ,
        0.33333333],
       ...,
       [0.87098845, 0.35992855, 0.25      , ..., 0.66666667, 0.5       ,
        0.33333333],
       [0.43709884, 0.15808276, 0.5       , ..., 0.33333333, 1.        ,
        0.66666667],
       [0.03209243, 0.61000298, 0.25      , ..., 0.66666667, 0.        ,
        0.33333333]])
In [33]:
aih2.datahub.get_dataset("kmart.test.Target_Sales.le.features.impsca")
Out[33]:
array([[0.71501926, 0.96427508, 0.25      , ..., 0.33333333, 0.        ,
        0.33333333],
       [0.69191271, 0.22298303, 1.        , ..., 1.        , 0.5       ,
        0.33333333],
       [0.9114249 , 0.59809467, 0.25      , ..., 1.        , 1.        ,
        0.        ],
       ...,
       [0.91527599, 0.32420363, 0.25      , ..., 1.        , 0.5       ,
        0.33333333],
       [0.33440308, 0.63977374, 0.5       , ..., 1.        , 0.5       ,
        0.33333333],
       [0.63992298, 0.29443287, 0.5       , ..., 1.        , 0.5       ,
        0.33333333]])
In [46]:
select_cols = ["Item_MRP", "Outlet_Establishment_Year"]
#select_cols = ["Item_MRP"]
item_mrp_idxs = auxhub.get_indices(cols, select_cols)
In [47]:
aih3 = aihub()
dk2 = datahub()
dk2.add_dataset("k_label", aih2.datahub.get_dataset("kmart.pct.train.Target_Sales.le.labels"))

dk2.add_dataset("k_test", aih2.datahub.get_dataset("kmart.test"))

dk2.add_dataset("k_test_impsca", aih2.datahub.get_dataset("kmart.test.Target_Sales.le.features.impsca")[:, item_mrp_idxs])

dk2.add_dataset("k_mrp_impsca", aih2.datahub.get_dataset(
    "kmart.pct.train.Target_Sales.le.features.impsca")[:, item_mrp_idxs])

aih3.load_data(dk2)
Instantiating A.I. hub
Instantiating Data Hub class
Instantiating Data Hub class
In [48]:
aih3.datahub.get_dataset("k_test_impsca")
Out[48]:
array([[0.32501155, 0.58333333],
       [0.2378191 , 0.91666667],
       [0.89331591, 0.54166667],
       ...,
       [0.37119946, 0.70833333],
       [0.77815384, 0.91666667],
       [0.20588425, 0.70833333]])
In [49]:
warnings.simplefilter("ignore")

aih3.run_classifier(
    ["k_mrp_impsca"],
    ["k_test_impsca"],
    ["k_label"],
    algos = ["dtree", "rforest", "gauss", "quad"]
)
Out[49]:
Train_Name Test_Name Labels_Name Main_Algo Algo Valid-Prediction Valid-Score Valid-Z-Score Test-Prediction Z-Score
0 k_mrp_impsca k_test_impsca k_label Classifier dtree [1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, ... 0.773085 [[0.10606060606060606, 0.8939393939393939], [0... [0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, ... [[0.6759098786828422, 0.3240901213171577], [0....
1 k_mrp_impsca k_test_impsca k_label Classifier rforest [1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, ... 0.777660 [[0.15475611211214474, 0.8452438878878551], [0... [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, ... [[0.6809973715039934, 0.31900262849600663], [0...
2 k_mrp_impsca k_test_impsca k_label Classifier gauss [1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, ... 0.724745 [[0.1262849178833999, 0.8737150821166], [0.908... [0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... [[0.6758412999806085, 0.3241587000193915], [0....
3 k_mrp_impsca k_test_impsca k_label Classifier quad [1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, ... 0.725214 [1.9998141185399252, -2.8058384194154797, 0.05... [0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... [-0.7614755176220629, -1.4645217178983563, 1.8...
In [50]:
aih3.datahub.get_names()
Out[50]:
['k_label',
 'k_test',
 'k_test_impsca',
 'k_mrp_impsca',
 'k_mrp_impsca.classifier.dtree.valid-prediction',
 'k_mrp_impsca.classifier.dtree.score',
 'k_mrp_impsca.classifier.dtree.valid-Z-score',
 'k_mrp_impsca.classifier.dtree.prediction',
 'k_mrp_impsca.classifier.dtree.Z-score',
 'k_mrp_impsca.classifier.rforest.valid-prediction',
 'k_mrp_impsca.classifier.rforest.score',
 'k_mrp_impsca.classifier.rforest.valid-Z-score',
 'k_mrp_impsca.classifier.rforest.prediction',
 'k_mrp_impsca.classifier.rforest.Z-score',
 'k_mrp_impsca.classifier.gauss.valid-prediction',
 'k_mrp_impsca.classifier.gauss.score',
 'k_mrp_impsca.classifier.gauss.valid-Z-score',
 'k_mrp_impsca.classifier.gauss.prediction',
 'k_mrp_impsca.classifier.gauss.Z-score',
 'k_mrp_impsca.classifier.quad.valid-prediction',
 'k_mrp_impsca.classifier.quad.score',
 'k_mrp_impsca.classifier.quad.valid-Z-score',
 'k_mrp_impsca.classifier.quad.prediction',
 'k_mrp_impsca.classifier.quad.Z-score',
 'classifier.results']
In [51]:
pred2 = aih3.datahub.get_dataset('k_mrp_impsca.classifier.rforest.prediction')
zscore2 = aih3.datahub.get_dataset('k_mrp_impsca.classifier.rforest.valid-Z-score')
new_df2 = pd.DataFrame(aih3.datahub.get_dataset('k_test'), columns = select_cols)
pred2 = pred2[list(pred2.columns)[0]]
new_df2['Target_Sales_Prediction'] = pred2
new_df2['Z-Score-0'] = zscore2[0]
new_df2['Z-Score-1'] = zscore2[1]
new_df2.head(20)
Out[51]:
Item_MRP Outlet_Establishment_Year Target_Sales_Prediction Z-Score-0 Z-Score-1
0 107.8622 1999 0 0.154756 0.845244
1 87.3198 2007 0 0.996444 0.003556
2 241.7538 1998 0 0.380097 0.619903
3 155.0340 2007 1 0.881547 0.118453
4 234.2300 1985 1 0.960593 0.039407
5 117.1492 1997 1 0.996444 0.003556
6 50.1034 2009 0 0.960593 0.039407
7 81.0592 1985 0 0.500804 0.499196
8 95.7436 2002 0 0.701285 0.298715
9 186.8924 2007 1 0.147392 0.852608
10 118.3466 2007 1 0.960814 0.039186
11 85.3908 2002 0 0.318835 0.681165
12 240.4196 1985 1 0.308493 0.691507
13 122.3098 1999 1 0.414336 0.585664
14 52.0298 1987 0 0.286727 0.713273
15 151.6366 1999 1 0.957414 0.042586
16 198.7768 2002 1 0.671764 0.328236
17 192.2188 2009 1 0.959809 0.040191
18 109.6912 1985 1 0.459970 0.540030
19 193.7136 1998 0 0.128822 0.871178